Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MONGOCRYPT-755 Implement StrEncode #928

Open
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

marksg07
Copy link
Collaborator

No description provided.

@marksg07 marksg07 marked this pull request as ready for review December 31, 2024 20:07
src/mc-text-search-str-encode-private.h Show resolved Hide resolved
src/mc-text-search-str-encode.c Outdated Show resolved Hide resolved
src/mc-text-search-str-encode.c Outdated Show resolved Hide resolved
src/mc-text-search-str-encode.c Outdated Show resolved Hide resolved
src/mc-text-search-str-encode.c Show resolved Hide resolved
src/mc-text-search-str-encode.c Show resolved Hide resolved
@marksg07 marksg07 marked this pull request as draft January 6, 2025 16:02
@marksg07 marksg07 requested a review from erwee January 6, 2025 21:36
@marksg07 marksg07 marked this pull request as ready for review January 7, 2025 16:20
src/mc-text-search-str-encode.c Outdated Show resolved Hide resolved
src/mc-text-search-str-encode-private.h Outdated Show resolved Hide resolved
src/mc-text-search-str-encode-private.h Outdated Show resolved Hide resolved
src/mc-text-search-str-encode.c Outdated Show resolved Hide resolved
src/mc-text-search-str-encode.c Show resolved Hide resolved
src/mc-text-search-str-encode.c Outdated Show resolved Hide resolved
src/mc-text-search-str-encode.c Outdated Show resolved Hide resolved
@marksg07 marksg07 requested a review from erwee January 13, 2025 21:18
src/mc-str-encode-string-sets-private.h Outdated Show resolved Hide resolved
src/mc-str-encode-string-sets.c Show resolved Hide resolved
src/mc-str-encode-string-sets.c Outdated Show resolved Hide resolved
src/mc-str-encode-string-sets.c Outdated Show resolved Hide resolved
src/mc-str-encode-string-sets.c Outdated Show resolved Hide resolved
src/mc-str-encode-string-sets.c Show resolved Hide resolved
src/mc-str-encode-string-sets.c Outdated Show resolved Hide resolved
src/mc-text-search-str-encode.c Show resolved Hide resolved
@marksg07 marksg07 requested a review from erwee January 15, 2025 21:59
Copy link
Contributor

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Only substantial comment is to limit string lengths to prevent possible overflows.

Comment on lines 47 to 52
fprintf(stderr,
"Testing nofold suffix/prefix case: str=\"%s\", lb=%u, ub=%u, unfolded_codepoint_len=%u\n",
str,
lb,
ub,
unfolded_codepoint_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using the TEST_PRINTF and TEST_STDERR_PRINTF macros to flush stdout/stderr and avoid mixed output in Evergreen logs. The macros were recently introduced in b193dba. Run git merge master to include them.

    TEST_STDERR_PRINTF("Testing nofold suffix/prefix case: str=\"%s\", lb=%u, ub=%u, unfolded_codepoint_len=%u\n",
                       str,
                       lb,
                       ub,
                       unfolded_codepoint_len);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to TEST_PRINTF's, thanks for this note.

Comment on lines 26 to 27
#undef MIN
#define MIN(a, b) (((a) < (b)) ? (a) : (b))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#undef MIN
#define MIN(a, b) (((a) < (b)) ? (a) : (b))

Suggest using BSON_MIN to simplify.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, totally forgot to do this in the test; changed it in the source file due to Erwin's comment earlier. Done.

@@ -119,6 +119,58 @@ bool mc_FLE2RangeInsertSpec_parse(mc_FLE2RangeInsertSpec_t *out,
bool use_range_v2,
mongocrypt_status_t *status);

typedef struct {
// mlen is the max string length that can be indexed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// mlen is the max string length that can be indexed.
// mlen is the max string length (in characters, not bytes) that can be indexed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a clarifying comment above.

Comment on lines 65 to 66
mc_FLE2TextSearchInsertSpec_t spec =
{str, byte_len, {{0, 0, 0}, false}, {{lb, ub}, true}, {{0, 0}, false}, false, false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mc_FLE2TextSearchInsertSpec_t spec =
{str, byte_len, {{0, 0, 0}, false}, {{lb, ub}, true}, {{0, 0}, false}, false, false};
mc_FLE2TextSearchInsertSpec_t spec = {.v = str, .len = byte_len, .suffix = {{lb, ub}, true}};

Suggest using designated initializers and omitting fields that are expected to be zero-initialized to improve readability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for the feedback.

uint32_t affix_count = 0;
uint32_t total_real_affix_count = 0;
while (mc_affix_set_iter_next(&it, &affix, &affix_len, &affix_count)) {
// Since all substrings are just views on the base string, we can use pointer math to find our start and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Since all substrings are just views on the base string, we can use pointer math to find our start and
// Since all substrings are just views on the base string, we can use pointer math to find our start and end

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ASSERT(sets->exact.len == byte_len);
ASSERT(0 == memcmp(sets->exact.data, str, byte_len));

if (unfolded_codepoint_len > mlen || lb > max_padded_len) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (unfolded_codepoint_len > mlen || lb > max_padded_len) {
if (lb > max_padded_len) {

Redundant with above check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

}
set->start_indices[idx] = base_start_idx;
set->end_indices[idx] = base_end_idx;
set->substring_counts[idx] = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider storing and incrementing the current set size in mc_affix_set_t, rather than requiring callers to track the index:

set->start_indices[set->cur_idx] = base_start_idx;
set->end_indices[set->cur_idx] = base_end_idx;
set->substring_counts[set->cur_idx] = 1;
set->cur_idx++;

That may help avoid exposing implementation details of mc_affix_set_t to the caller.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, thanks for the suggestion.

it->cur_idx = 0;
}

bool mc_affix_set_iter_next(mc_affix_set_iter_t *it, const char **str, uint32_t *len, uint32_t *count) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool mc_affix_set_iter_next(mc_affix_set_iter_t *it, const char **str, uint32_t *len, uint32_t *count) {
bool mc_affix_set_iter_next(mc_affix_set_iter_t *it, const char **str, uint32_t *byte_len, uint32_t *count) {

To clarify output is byte length, not character length.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

// Linked list node in the hashset.
typedef struct _mc_substring_set_node_t {
uint32_t start_offset;
uint32_t len;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint32_t len;
uint32_t byte_len;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

mc_utf8_string_with_bad_char_t *mc_utf8_string_with_bad_char_from_buffer(const char *buf, uint32_t len) {
BSON_ASSERT_PARAM(buf);
mc_utf8_string_with_bad_char_t *ret = bson_malloc0(sizeof(mc_utf8_string_with_bad_char_t));
_mongocrypt_buffer_init_size(&ret->buf, len + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect this could overflow if len is UINT32_MAX. Similarly, the CBC length calculations may overflow when adding 15.

I suggest rejecting too-long strings in mc_text_search_str_encode, and using a limit much smaller than UINT32_MAX. If the limit is near UINT32_MAX, these operations may be prohibitively slow to be useful and could risk a denial-of-service attack. Consider using 16MiB (16777216 bytes) to match the maximum insert size of a BSON document (maxBsonObjectSize from the hello command reply).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally thinking that we would perform this check at a higher level, but I think now I agree with you that it makes sense to do the checks here, nearer to where the actual algorithm is taking place. One thing to note is that for the substring case, even getting close to 16MiB is going to be way too big. Figuring out the max values for all the parameters that go in is still TBD for later in the project, so I'll put the limit at 16MiB for now and we can think more about this later.

@marksg07 marksg07 requested a review from kevinAlbs January 21, 2025 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants